**Engineering** **a** **Multi-core** **Radix** **Sort**

Jan Wassenberg1 and Peter Sanders2

2

1 Fraunhofer IOSB, Ettlingen, Germany

jan.wassenberg@iosb.fraunhofer.de

Karlsruhe Institute of Technology, Karlsruhe, Germany

sanders@kit.edu

**Abstract.** We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 89% of the system’s peak memory bandwidth. Our implementation outperforms Intel’s recently published radix sort by a factor of 1.64. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer over- head is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can beneﬁt from the techniques de- scribed herein.

**1** **Introduction**

Sorting is a fundamental operation that is a time-critical component of various applications such as databases and search engines. The well-known lower bound of Ω (N · log N) for comparison-based algorithms no longer applies when special properties of the keys can be assumed. In this work, we focus on 32-bit integer keys, optionally paired with a 32-bit (or larger) value. This simpliﬁes the imple- mentation without loss of generality, since applications can often replace large records with a pointer or index [[1].](#_bookmark1) The radix sort algorithm is commonly used in such cases due to its O (n) complexity. In this report, we show a 1.64-fold performance increase over results recently published by Intel [[2].](#_bookmark2)

The remaining sections are organized in a bottom-up fashion, with Section [2](#_bookmark3) dedicated to the basic realities of current and future microarchitectures that af- fect memory-intensive programs and motivate our approach. We build upon this foundation in Section [3](#_bookmark4), showing how to speed up counting sort by taking ad- vantage of virtual memory and write-combining. Section [4](#_bookmark5) applies this technique towards a novel variant of radix sort. The performance of our implementation is evaluated in Section [5](#_bookmark6). Bandwidth measurements indicate the per-pass through- put is nearly optimal for the given hardware. Its two CPUs outperform a Fermi GPU when accounting for data-transfer overhead.

E. Jeannot, R. Namyst, and J. Roman (Eds.): Euro-Par 2011, LNCS 6853, Part II, pp. 160[–169,](#_bookmark7)2011. ![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAjUlEQVQYlX3PvwpBYRzG8U8nJYnBZnULZ7VZDHa5BFcjZZcbUMpsMyodk1yAlBtgYfC+OZ1O71PP8vy+v3/81cUa5+AVmiqa4Y5hKRvjgVEMBnijUe0OEz/owQ55DRQ1wVK4p5UAOzhmeKGdAPt4xtHbBHhAnmHv98yiAmTY4IJTuTDHFUXwDdPEpnp9AZtVGA0qn8IXAAAAAElFTkSuQmCC)c Springer-Verlag Berlin Heidelberg 2011

Engineering a Multi-core Radix Sort 161

**2** **Software** **Write-Combining**

We begin with a description of basic microarchitectural realities that are likely to have a serious impact on applications with numerous memory accesses, and show how to avoid performance penalties by means of Software Write-Combining. These topics are not new, but we believe they are often not adequately addressed. The ﬁrst problem arises when writing items to multiple streams. An ideal cache with at least as many lines could exploit the writes’ spatial locality and entirely avoid noncompulsory misses. However, perfect hit rates are not achiev- able in practice due to limited ways of associativity a [[3].](#_bookmark8) Since only a lines can be mapped to a cache set, any further allocations from that set result in the eviction of one of the previous lines. If possible, applications should avoid writing to many diﬀerent streams. Otherwise, the various write positions should map to diﬀerent sets to avoid thrashing and conﬂict misses. For current L1 caches with a = 8 ways, size C = 32 KiB and lines of B = 64 bytes, there are

S = C![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAACABIDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooAKKKKAP/9k=)a·B = 64 sets, and bits [lg B, lg B +lg S) of the destination addresses should diﬀer (e.g. by ensuring the write positions are not a multiple of S · B = 4 KiB apart).

A second issue is provoked by a large number of write-only accesses. Even if an entire cache line is to be written, the previous destination memory must ﬁrst be read into the cache. While the corresponding latency may be partially hidden via prefetching, the cache line allocations remain problematic due to capacity constraints and eviction policy. Instead of displacing write-only lines that are not accessed after having been ﬁlled, the widespread (pseudo-)Least-Recently- Used strategy displaces previously cached data due to their older timestamp. An attempt to avoid these evictions by explicitly invalidating cache lines (e.g. with the IA-32 CLFLUSH instruction) did not yield meaningful improvements. Instead, applications should use non-temporal streaming store instructions that write directly to memory. These are guaranteed to avoid cache pollution since they circumvent the cache.

This leads directly to the next concern: single memory accesses involve sig- niﬁcant bus overhead. The architecture therefore combines neighboring non- temporal writes into a single burst transfer. However, currently microarchitec- tures only provide four to ten write-combine (WC) buﬀers [[4].](#_bookmark9) Non-temporal writes to multiple streams may force these buﬀers to be ﬂushed to memory via ‘partial writes’ before they are full. The application can prevent this by making use of Software Write-Combining [[5].](#_bookmark10) The data to be written is ﬁrst placed into temporary buﬀers, which almost certainly reside in the cache because they are frequently accessed. When full, a buﬀer is copied to the actual destination via consecutive non-temporal writes, which are guaranteed to be combined into a single burst transfer.

This scheme avoids reading the destination memory, which may incur rela- tively expensive Read-For-Ownership transactions and would only pollute the cache. It works around the limited number of WC buﬀers by using L1 cache lines for that purpose. Interestingly, this is tantamount to direct software control of the transparently managed cache.

162 J. Wassenberg and P. Sanders

We recommend the use of such Software Write-Combining whenever a core’s active write destinations outnumber its write-combine buﬀers. Fortunately, this can be done at a fairly high level, since only the buﬀer copying requires special vector loads and non-temporal stores (which are best expressed by the SSE2 intrinsics built into the major compilers).

**3** **Virtual-Memory** **Counting** **Sort**

We now review Counting Sort of N elements with keys in [0,M) and describe an improved variant that makes use of virtual memory and write-combining.

The na![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAYAAAARCAYAAAD+H91rAAAAWUlEQVQYlb3OMQ5AUBBF0YNEInZlFXah1tiZ2hYUFqDAEtD83xDRiJu8mWRu8jL8w4IKNSZkwkjRY8WG4bUqQYsRxVXmmFHGQxr2HuIqbjyKKBsc6F5f/ZATKOIMDkyIEVgAAAAASUVORK5CYII=)ve algorithm ﬁrst generates a histogram of the N keys. After comput- ing the preﬁx sum to yield the starting output location for each key, each value is written at its key’s output position, which is subsequently incremented.

Our ﬁrst optimization goal is to avoid the initial counting pass. We could instead insert each value into a per-key container, e.g. a list of data blocks. However, this incurs some overhead for checking whether the current bucket is full. Preallocating space for M arrays of size N is more eﬃcient, because items can simply be written to the next free position (c.f. Algorithm [1](#_bookmark11), introduced in [[6]](#_bookmark12)). This algorithm only writes and reads each item once, a feat that comes at

|  |
| --- |
| **Algorithm** **1**: Single-pass counting sort |
| storage := ReserveAddressSpace(N · M);    **for** i := 0 **to** M − 1 **do** next [i] := i · N;  **foreach** key,value **do**  storage [next [key]] := value; next [key] := next [key] + 1; |

the price of N · M space. While this appears problematic in the Random-Access- Machine model, it is easily handled by 64-bit CPUs with paged virtual memory. Physical memory is only mapped to pages when they are ﬁrst accessed,[1](#_bookmark13) thus reducing the actual memory requirements to O (N +M · pageSize). The remainder of the initial allocation only occupies address space, of which multiple terabytes are available on 64-bit systems.

Having avoided the initial counting pass, we now show how to eﬃciently write values to storage using the write-combining technique described in Section[2](#_bookmark3). Our implementation initializes the next pointers to consecutive, naturally aligned, cache-line-sized buﬀers. A buﬀer is full when its (post-incremented) position is evenly divisible by its size. When that happens, an unrolled loop of non-temporal writes copies the buﬀer to its key’s current output position within storage. These output positions are also stored in an array of pointers.

1 Accesses to non-present pages result in a page fault exception. The application re- ceives such events via signals (POSIX) or Vectored Exception Handling (Microsoft Windows) and reacts by committing memory, after which the faulting instruction is repeated.

Engineering a Multi-core Radix Sort 163

**4** **Radix** **Sort**

After a brief review of radix sorting, we introduce a new variant based on the

virtual-memory counting sort described in Section [3](#_bookmark4).

A radix sort successively examines D-bit ‘digits’ of the K-bit keys. They are characterized by the order in which digits are processed: starting at the Least Signiﬁcant Digit (LSD), or Most Signiﬁcant Digit (MSD).

An MSD radix sort partitions the items according to the current digit, then recursively sorts the resulting buckets. While it no longer needs to move items whose previously seen key digits are unique, this is not especially helpful when the number of passes K/D is small. In fact, the overhead of managing numerous (nearly empty) buckets makes MSD radix sort less suited for relatively small N .

By contrast, each iteration of the LSD variant partitions all items into buckets by the current key digit. This amortizes the bucket setup cost over the number of elements and avoids the possibility of load imbalance for parallelization at the price of increased data copying.

To reduce this overhead and also parallel communication, we make use of “reverse sorting” [[7],](#_bookmark14) in which one or more MSD passes partition the data into buckets, which are then locally sorted via LSD. This turns out to be even more advantageous for Non-Uniform Memory Access (NUMA) systems because each processor is responsible for writing a contiguous range of outputs, thus ensuring the OS allocates those pages from the processor’s NUMA node [[8].](#_bookmark15)

Let us now examine the pseudocode of the radix sort (Algorithm [2](#_bookmark16)), choosing K = 32 for brevity and D = 8 to allow extracting key digits without masking. Each Processing Element (PE) ﬁrst uses counting sort to partition its items into local buckets by the MSD (digit = 3). Note that items consist of a key and value, which are adjacent in memory (ideally within a native 64-bit word, but larger combinations are possible in our implementation via larger user-deﬁned types). After all are ﬁnished, the output index of the ﬁrst item of a given MSD is computed via preﬁx sum. Each PE is assigned a range of MSD values, sorting

the buckets from all PEs for each value. Skewed MSD distributions can cause

load imbalance. However, this could be resolved via special treatment of large buckets[2](#_bookmark17). The local sort entails K/D − 1 iterations in LSD order. The ﬁrst copies all other PEs’ buckets into local memory. The second to last pass also computes the last digit’s histogram, thus allowing writing directly to the output positions in the ﬁnal pass. Note that three sets of buckets are required, which makes heavy use of virtual memory (3 · 2D · |PE| = 6144 times the input size). While 64-bit Linux grants each process 128 TiB address space, Windows limits this to 8 TiB, which means only about 1.4 GiB of inputs can be sorted[3](#_bookmark18) .

We brieﬂy discuss additional system-speciﬁc considerations. The radix 2D was motivated by easy access to each digit, but is also limited by the cache

2 Sorting buckets larger than N/|PE| using multiple PEs.

3 This limitation could be circumvented by estimating bounds for bucket sizes via sampling. In the unlikely case that they are exceeded, a new sample would be drawn and the process repeated.

164 J. Wassenberg and P. Sanders

|  |
| --- |
| **Algorithm** **2**: Parallel Radix Sort |
| **parallel** **foreach** item **do**              d := Digit(item, 3); buckets3[d] := buckets3[d] ∪{item}; Barrier ;  **foreach** i ∈ 0, 2D  **do**  bucketSizes[i] :=  PE |buckets3[i]|;  outputIndices := PrefixSum(bucketSizes);  **parallel** **foreach** bucket3 ∈ buckets3 **do**  **foreach** item ∈ bucket3∀ PE **do**  d := Digit(item, 0);  buckets0[d] := buckets0[d] ∪{item};  **foreach** bucket0 ∈ buckets0 **do**  **foreach** item ∈ bucket0 **do**  d := Digit(item, 1); buckets1[d] := buckets1[d] ∪ {item}; d := Digit(item, 2); histogram2 [d] := histogram2 [d]+ 1; **foreach** bucket1 ∈ buckets1 **do**  **foreach** item ∈ bucket1 **do**  d := Digit(item, 2); i := outputIndices[d]+ histogram2 [d]; histogram2 [d] := histogram2 [d]+ 1; output[i] := item; |

and TLB size. Because of the many required TLB entries, we map the buckets with small pages, for which the Intel i7 microarchitecture has 512 second-level

TLB entries. To increase TLB coverage, we use large pages for the inputs. The working set consists of 2D buﬀers, buﬀer pointers, output positions, and 32-bit histogram counters. This ﬁts in a 32 KiB L1 data cache if the software write- combine buﬀers are limited to a single 64-byte cache line. To avoid associativity and aliasing conﬂicts, these arrays are contiguous in memory. Interestingly, these optimizations do not detract from the readability of the source code. Knowledge of the microarchitecture can also be applied towards middle-level languages and enables principled design decisions.

**5** **Performance** **Evaluation**

We characterize the performance of our sorting implementation by its through-

put, deﬁned as ![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABkAAAASCAYAAACuLnWgAAABH0lEQVQ4jb3UTytEURgG8N8YktQsyCApFnaztrCXjYUUhYW9DUv5Bj6FmrL0Z+MDSDZWxE5RrCwUVgqxOHe6d6Zp3HvNeOrUees873PO8573LUqPSRzgAzdRXMUrbjPkaYlu7OINPShiMw2xmEFkDN/oRwn3GMV1lpv+hhUMYQIvWMZ4OwVgQ/zyE5wLFrYNS7hCOYqnsN9OgT+jIBSzo+jqtEAjhmX70rn4F+LC5kFLfgGrQm0WciRPzZ/GYSKu4DJxszV8Rolq6z0618jfic7voY+4meZxLPj6JAzAR7HHZ1hX7/kX7prwtzCCOSxivyZSEubQDI6avPRB6+ZL8gcECwlzrg69iX0Fz9iW/sfV+FXM4hSDKbm5UPbfPfgDqT8wRJkdzU4AAAAASUVORK5CYII=) , where N is the number of items and t0 and t1 are the ear-

liest and latest start and ﬁnish times reported by any thread. The test platform consists of dual W5580 CPUs (3.2 GHz, 48 GiB DDR3-1066 memory) running Windows XP x64. Our implementation is compiled with ICC 11.1.082 /Ox /Og /Oi /Ot /Qipo /GA /GR- /GS- /EHsc /Qopenmp /QaxSSE4.2. When sorting 350 M

Engineering a Multi-core Radix Sort 165

uniformly distributed 32-bit keys generated by the WELL512 algorithm [[9],](#_bookmark19) the basic algorithm (‘VM only’) reaches a throughput of 391 M items/s, as shown in the second column of Table [1](#_bookmark20). After enabling write-combining (‘VM+WC’), performance nearly doubles to 657 M/s.

Intel has reported 240 M/s for the same task and a single but identical CPU [[2].](#_bookmark2) For a fair comparison with our dual-CPU system, we double their through- put, which optimistically assumes their algorithm is NUMA-aware, scales per- fectly and is not running at a lower memory clock (since our DDR3-1066 is at the lower end of currently available frequencies). We must also divide by the given speedup of 1.2 due to hyperthreads, since those are disabled on our ma- chine. This (‘Intel x2’) yields 400 M/s; the proposed algorithm is therefore 1.64 times as fast. A separate publication has also presented results [[10]](#_bookmark21) for the Many

Integrated Cores architecture. The Knights Ferry processor provides 32 cores, each with 4 threads and 16-wide SIMD. The simulation (‘KNF MIC’) shows a throughput of 560 M/s. Our scalar implementation is currently 1.17 times as fast when running on 8 cores.

Recently, a throughput of 1005 M/s was reported on a GTX 480 (Fermi) GPU [[11].](#_bookmark22) However, this excludes driver and data-transfer overhead. For applications in which the data is generated and consumed by the CPU, we must include at least the time required to read and write data over the PCIe 2.0 bus. Assuming the peak per-direction bandwidth of 8 GB/s is reached, the aggregate throughput (‘GPU+PCIe’) is 501 M/s. Our implementation, running on two CPUs, therefore outperforms this algorithm on a current top-of-the-line GPU by a factor of 1.31 despite lower transistor counts (2 · 731 M vs. 3000 M) and thermal design power (2 · 130 W vs. 275 − 300 W).

**Table** **1.** Throughputs [million items per second] for 32-bit keys and optional 32-bit values

|  |  |  |
| --- | --- | --- |
| Algorithm | K=32,V=0 | K=32,V=32 |
| VM only | 391 | 238 |
| Intel x2 | 400 | 307 |
| GPU+PCIe | 501 | 303 |
| KNF MIC | 560 | (?) |
| VM+WC | 657 | 452 |

Similar measurements and extrapolations for the case of 32-bit keys associated with V = 32-bit values are given in the third column of Table [1](#_bookmark20). Since the slowdown is less than a factor of two, the implementations are at least partially limited by computation instead of bandwidth. Intel’s algorithm is more eﬃcient in this regard, with only a 1.3-fold decrease vs. our factor of 1.45. The additional data transfers over PCIe render the GPU algorithm uncompetitive.

![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAABAE0DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooAKKKKACiiigAooooAKKKKAP/9k=)![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAPMAAAC+CAYAAADgMDmNAAAIp0lEQVR4nO3daa9eVRmH8eucTrTQUlpaxpZKtYAIBUsYyyRTGCIQQAkSCRAFQsCmRLCKBAgggyKiBgMmkEBAjCIaQAEBGWTGULHMU4EyfQbN3xfPiq9MDAubm63X7wv0etM8e++z1n2DpP8B2bK6QNLHknHIZZAXJlanSOqVGcDNwFRgj+IYSX3yacjzkJ9A/FGWhin7Q96HfL26RFK3nAl5F7JndYmkLpkMuQ7yLGSL6hpJXTIX8jDk15B1q2skdckOkDch50PGqmskdcnRkA8hR1WXSOqSMciFkDcgi6trJHXJupDb2jvynOoaSV2yALKyfbWeVF0jqUv2hrwHOaO6RFK3nNpOdO1XXSKpSyZCroH8DbKwukZSl8yGPAD5HWR6dY2kLtkO8hrkktF9ZEkDlMMhH0COrS6R1C3nQt6CLKkukdQl0yC3Qh6DbFxdI6lL5kGegdwAmVJdI6lLdoesgSyvLpHULSe2D10HVZdI6pIJkKsgL0K2qq6R1CUbQO6B/B4ys7pGUpdsDXkZ8oPRr7OkAcoh7f34hOoSSd1yNuQdyK7VJZK6ZAbkl5CnIJtV10jqkiWQV9tqGA+CSMOUM9v7sRMzpWHKzDZo70nIp6prJHXJzpDXIT8crYmRNEBZ3uZzHV5dIqlLZrWRPo9B5lfXSOqS3SGrIVc4v1oapIxBzmmP1YdW10jqkg0hd7W1MJtX10jqkj3bbK5LRrOsJQ1MxiHfgbzrEAFpsDK33T3+E2TT6hpJXbJvu+l0oXePpUHKOOT8NmTPJW3SMGVjyP2Q+5xdLQ1WDmi/xue520kapEyAXAR5e7TMXNIAZTPIg5C7IXOqayR1ycGQ9yArRkc0JQ1MJkIua6e5llbXSOqSeZA/Q+6EzK6ukdQlh7WbTt/0sVoapExqGyTehOxWXSOpSxZAHof8drTjSdIA5Yg27nZZdYmkLhmHXAp5YzQxU9IAZVY7AHKvX6ulwcriNrf6cq8sSoOV4yAfQr5cXSKpSyZAroS8AtmuukZSl8xpd4/v8s9O0mBlSTsEcpF3j6XBygnt/fjI6hJJXTKpLS5/CbJNdY2kLtkI8lA7ljmjukZSl+zS7h6f520nabDytXa++rDqEkldMhlyLWQV5DPVNZK6ZNO2vPxXkPWqayR1yR5tJcyK6hJJ3XJ6G+vjpkVpmLIO5HrISsiW1TWSumQe5CnILZBp1TWSumSftsD8rOoSSd2yrG2TcGWqNEyZCrkJ8gxki+qa/0deMdN/QRYAjwL/APaAsdW1PZI65ID2Z6czqkskdcvZ7UPXXtUlkrpkXcitkCcgm1fXSOqShZDnID+HTKmukdQlh7Rri6dUl0jqknHIdyFvu21RGqxs1FbCPADZpLpGUpd8AbIGcoFrYaRBynj7D7xm9B9a0gBlk/ZIfe/oEVvSAOXAdgjkXLdJSIOUCZCL29fqvatrJHXJZm0I/R9GC9skDVAObnePv+UQemmQMhFyGWT1aGqmpAHKfMijkDsgs6trJHXJF9vd47N8rJYGKZMgV7Yl5rtW10jqkgXt3vHtkA2qayR1yZHtyuKy6hJJXTIZcjXkNchO1TWSumQh5Om2aXH96hpJXXJMe6w+vbpEUpesA7kG8grk89U1krpkEeRZyC8g06trJHXJcZAPHbAnDVamQq6DvARZXF0jqUu2hvy1LWlbr7pGUpd8tT1Wn1RdIqlLpkGuhzwP2ba6Rp9Mznn6xMu2wFPAGLATjK0qDpL00eUr7bH6hOoSSV0yoV1ZfNnHammwMhvyR8hdkJnVNZK6ZDHkdcj3nFstDVaObe/Hx1SXSOqSccjl7e7x9tU1krpkA8jdba/TrOoaSV3yOcirkO+7LlUarBzV3o+Pqy6R1CXjbUHbG5Adq2skdcn6kDsh90M2rK6R1CXbtLvHV432PEkaoBzu+Wpp0DIGOb9tWnR2tTRMmd7WwTwEmVtdI6lLFrUhAj8dLWyTNEA5tA2hP7m6RFKXjEHOhbztylRpsLJe2+v0KGST6hpJXbIQ8hzk2tHmRUkDlIMg70NOrS6R1C3nQNZAllaXSOqSaW1B25OQzatrJHXJgrZt8XrIlOoaSV2yH+Q9yBnVJZK6ZTnkXcje1SWSumQq5EbIM5D51TXSf+Js5n8r84FHGO13WgpjbxUHSfrocmh7P15eXSKpS6ZAroa86d+PpcHKZyErIbe630karJzWxvqcWF0iqUtmQ34DeXo0UEDSAGXfdvf4Cm87SYOUiZBLIO9ADqiukdQlW0KegNwBmVNdI6lLjm8fuc6sLpHUJdMhN0FWuftYGqzs0lamXjM6Zy1pYDIO+XYb6XNEdY2kLtkc8kDbtLhZdY2kLjmy/RqvGP06SxqYTIX8rL0f71xdI6lLtm97nW4cfbmWNED5Rvvb8fHVJZK6ZA7kTsjjo1NdkgYoB7bh8xePzllLGphMhnwf8hZkn+oaSV2yqE3IvA0yq7pGUpec1D5ynVJdIqlLZrZ5XCsh21TXSOqSpW1C5o/c6SQNUiZALmirYA6prpHUJXPaBYl7IBtX10jqkh0gb0AugoxV10jqki+1r9VHV5dI6pLxNiXzdcf5SIOVGW1C5v2QDatrJHXJIsgLkB97tloarBwM+QBycnWJpG45p22R2K26RFKXTIXcDHnSAXvS2rUWB99lHvAI8HdgLxhbs/b+LUlrSZa2IQLLq0skdcspbeStWxalYcqkNvJ2FWRhdY2kLpkDeQhyuyNvpcHKju3+8YVelJAGK8e2ixJHVZdI6pJxyKWQ1yDbVddI6pL12xD6+yCzq2skdclWkBfbfC4vSkjDlEPbRYkTq0skdcuKdlFi1+oSSV0yDXIL5AnIptU1krpkPuQvkBucXy0NVvZq86uXVZdI6pbT2kWJ/atLJHX510WJ51xiLg1W5kIebmtT16uukdQlSyCrIed7UUIatDwIObK6QtLH80+9oOFnWN7TiwAAAABJRU5ErkJggg==)166 J. Wassenberg and P. Sanders

Speedup vs. single thread

![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAPkAAADNCAYAAAB6pNYrAAAKN0lEQVR4nO3de2ydZQHH8e/b05azdt06VrZVNigSroqXaIjBTAhGRePtDzUaQ0QNBAzBTEKMBsmTaKIhagJEJ8olBNCIQNQ4LzEYXKIQxSgREJlDx8C5S0tZ19L2XB7/aKentRudZec559n3k/DHeXnZ+WXdl5692d634L8G1q9f/8SLL764ouEYMcbayMjI0zRRV1fXshhjrVqtTjfzfRdSLpdXTE5O7k+9A6C7u7t3enp6PPUOvz4L6+7u7pmenp5o5nv29/efXCqVuuftmNq1a9fZwHPzzz9/cHBwtGnrDiOE8K0QwuWpdwCEEFri5wQghPDz1BsAQgh3hBA+lnoHQAhhKvWGg0II96feADA0NLQT+NDB1x0Jt0hqAiOXMmfkUuaMXMpcZ+OLF1544ZlUQ+a5BWiVC16fTD2gwZdTD5j1TWBP6hGzLk49oMHXUw8AGB4e3naof3c+8OsmbpF0dNyDV9elY4eRS8cQP65LbS2eBdW/wngNpipQuQdid+dL/4eSWl/sgtpWeHY17CugowNOeTf03ejHdSkP58NU90zgAHXgmWVQXGzkUh4O1XKHkUt5eBDKNVg1+7IANkxC/H7jSV54k9pafB1Un4UXqjA5DZWfQuxpPMPIpTz4h2GkY4mRS5kzcilzRi5lzsilzBm5lDkjlzJn5FLmjFzKnJFLmTNyKXNGLmVuzp1h1q5de/YVV1zx7LxzRkIIr2niJkmLFEL4FXB647E777yze/v27fcudL5/C03Kg38LTTqWGLmUOSOXMmfkUuaMXMqckUuZM3Ipc0YuLVk8FyrfhcovIX4CYkt15bPQpCWJF0HtPti1DCoFrDsPut8OfDj1soNa6v84Uvup3gRP98CeAp4HnuwB3gvx7NTLDjJyaUk6huBAw+sITFSAM9Ls+V9GLi1J/UlY2fC6BPR2A48mGvQ/jFxaks7L4ORxWF+BNcBZ48DNUDydetlBXniTlqR4COJr4YTLIK6G0n1Q/Cz1qkZGLi1ZsR34bOoVh+LHdSlzRi5lzsilzBm5lDkjlzJn5FLmjFzKnJFLmTNyKXNGLmXOyKXMzfmz6729vauvueaaS+edMxlCuLOJmyQtUgjhA8CqxmM33njj+pGRkf+8nhN5V1fXMuDceT/OGGDkUms6B3hF44FyubzyEOf6wEMpEz7wUDqWGLmUOSOXMmfkUuaMXG0qngWVLVD5J1R+MXOfNS3Ee7ypDcUNUHsYdvXB/gL6BuHE8yC+Hoq/pV7XavxOrjZU/xQMl2eeWjIJ7AX2lKH2mdTLWpGRqw3VT4EXu+cem+yEeEqaPa3NyNWGOrfACeNQNBwbmIDSj5NNamFGrnb0PThuK7xqHDZMw6sOQM/vobgt9bBW5IU3taGiCrwL4gWw5rXA48ADUMS0u1qTkauNFQ8CDyYe0fL8uC5lzsilzBm5lDkjlzJn5FLmjFzKnJFLmTNyKXNGLmXOyKXMGbmUOSOXMmfkUuaMXMrcnL9qevzxxw9dddVV9887Z38I4ZLmTZK0WCGEG4ANjcduvfXWc3bu3HnvQuf7LDQpDz4LTTqWGLmUOSOXMmfkOkKxDPGtEM9NvUSL440cdQTiRqhvgckIpQ6oPgel86HYnXqZDs3ItUjxOKj/BP7WB2Ozx048FQbuAt6WcpkOz4/rWqzzZp47NtZwaFcnlC5MNUiLY+RarDHonPfrpQTE6SRrtGhGrkUqHoHSTjixOvPLpgs4ZQLizamX6fCMXEegdAEMPACvq8OrJ6Hn21C6OvUqHZ4X3nQEij3ARalX6Mj4nVzKnJFLmTNyKXNGLmXOyKXMGbmUOSOXMmfkUuaMXMqckUuZM3Ipc0YuZc7IpcwZuZQ5I5cyZ+RS5ubcNGLdunXnXH755RPzztkXQjipiZskLVII4Q/AWY3H7r777qlt27b5wEMpY3MeeOjtn9pCXAa8HxgAfgHFU4kHqY34e/KWF4eg9ncYuxn2XQ+1R6HmzRO1aEbe8qqbYfcJ8FQf7CjD42WIX4K4PvUytQcjb3kdG2Fvw9epAoxVgLekWqT2YuQtr74XyvOOlQvguRRr1H6MvOV1XAuvnICVwDJgaBq6tgNbEw9Tm/Dqessr3Q1xEoauA1ZD8SMofR6KmHqZ2oORt4XiPuC+1CvUnvy4LmXOyKXMGbmUOSOXMmfkUuaMXMqckUuZM3Ipc0YuZc7IpcwZuZQ5I5cyZ+RS5oxcypyRS5kzcilzRi5lzsilzBm5lLk593gbGBg49corr5z/PLTREML7mrhJ0iKFEO4AhhqP3X777Sfv2LHDBx6+tFiG+HGo3gr1T0NckXqRtEg+8PClxTLUfgMTZ8LzPbBiAvo2QXwDFMOp10lHwt+TL+wjMHEGPNUDe4HtPTA6CHUfNKi2Y+QLqr0Znu+de+z5bqhtTLNH+v8Z+YI6/jzzEb1RXwU6/pRmj/T/M/IFFbdD396Z546tBNZXYGAMStenXiYdKS+8LajYP3ORbdXVsGLjzHfw0vVQ7Ey9TDpSRn5IxTDw+dQrpKXy47qUOSOXMmfkUuaMXMqckUuZM3Ipc0YuZc7IpcwZuZQ5I5cyZ+RS5oxcypyRS5kzcilzRi5lzsilzBm5lLkWvDNMvABqHwXGoXQLFI+lXiS1sxaLvPo5qFwLu3ugVIM1l0L8IBQ/Tb1MaldzIl+5cuXgpk2bvjLvnPEQwheP/pS4GurXwV/KUAEowVgPnPod4MSj//5S+wkhbALWNh7bvHnz6bt37/7P6zmR1+v1GjA678eZd//xo+ZMmJyCSvm/h8aAYk2T3l9qR2PAcY0H6vV69VAnJ37gYVwH1Un4Y4RHZv95IkJ1X7pNUlua88DDFrq6XvwL2AxnjsNq4ATgtAkoPp14mNTWWuzCW+cmiA/B+kuAcei8CYqtqVdJ7azFIgco7mHm44akl0ELfVyXdDQYuZQ5I5cyZ+RS5oxcypyRS5kzcilzRi5lzsilzBm5lDkjlzJn5FLmjFzKnJFLmTNyKXNGLmXOyKXMGbmUuTmRDw4OvibVkEYhhG+EEC5LvQMghNAyd4sNIWxJvQEghHBbCOHi1DsAQgjjqTccFEL4QeoNAENDQ29qfD3/O3nRxC2HU6J1PmW00n3wSqkHzPLrs7BW+frM6Xj2CxV74IdvP3DgU8dBbInv5pJeHp0QN0Dtd7Bx5djYG8tQfRiqX4POL6QeJ2npOqB2A+xZA39fBv8CHl8GXA3xtNTjJC1dAdUReGIVTDccXsfy5V+tlst3Vev1em1kZGR7M0f19vaurlar01NTU2PNfN+F9Pf3nzQ6OvpM6h0AfX19a8fGxna/9JlHV29v70C1Wp2cmpo6kHpLf3//yaOjoztS7wBYvnz5mgMHDuxp5nuuWrVqqFQqdTceizEWw8PDFwK/nT1UeQy2NTx/7JEIE/shvqOZYyUdNfE9UJmA7bMPGNw3BdXHIbbK1VNJSxffCZWHofIPqN4EsT/1Ikkvj38DFCVO28tb4+0AAAAASUVORK5CYII=)

8

7

6

5

4

3

2

1

2 3 4 5 6 7 8

1

Number of threads

**Fig.** **1.** Linear scalability on two quad-core CPUs with a NUMA factor of 1.5

Since radix sort is bandwidth-sensitive, it is also interesting to examine per- formance for a varying number of processors. We manually distribute OpenMP threads across CPU packages and cores (in that order) to make use of all avail- able memory controllers. Our NUMA-aware implementation scales linearly with the number of threads, as shown by Figure [1](#_bookmark23).

To explain the 95% parallel eﬃciency, we measured the total traﬃc at each socket’s memory controller. Since this information is not available from current proﬁlers such as VTune (which use per-core performance counters), we have developed a small kernel-mode driver to provide access to the model-speciﬁc performance counters in the Intel i7 uncore[4](#_bookmark24). Uncached writes constitute the bulk of the write combiners’ memory traﬃc and are therefore of particular interest. They are apparently reported as Invalid-To-Exclusive transitions and can thus be counted as the total number of reads minus ‘normal’ reads [[12].](#_bookmark25) We ﬁnd that 2041 MiB are written, which corresponds to 64 Mi items · 8 bytes per item · 4 passes (slightly less because our ﬁnal pass cannot use non-temporal writes when the output position is not aligned). Surprisingly, 2272 MiB are read – about 10% more than expected. This amount seems to be inﬂuenced by the number of threads. Possible causes may include coherency traﬃc or page walks and will be investigated in future work. However, we can provide a conservative estimate of the bandwidth utilization. Given the pure read and write bandwidths (38687 MB/s and 28200 MB/s) measured by RightMark [[13],](#_bookmark26) the minimum time required for 4 reads and writes of 175 M 8-byte items is 343 ms, which is 89% of the total measured time. This calculation does not include write-to- read turnaround [[14](#_bookmark27), p. 486], so there is even less room for improvement than indicated.

4 The part of the socket not associated with a particular core.

Engineering a Multi-core Radix Sort 167

Nanoseconds per item

1.7

1.68

1.66

1.64

1.62

1.6

1.58

1.56

1.54

1.52

1.5

|  |
| --- |
| Uniform Gaussian |

50 100 150 200 250 300

Number of 32-bit items [ ·220 ]

**Fig.** **2.** Time per item for various input sizes and distributions

The previous measurements concern large numbers of items. We now study performance over a wider range of input sizes. The elapsed time per item, shown in Figure [2](#_bookmark28), varies inversely with the number of items N due to amortization of thread-startup overhead. Performance is within 10% of the best measurement when N ≥ 26 · 220 , or N ≥ 21 · 220 in the case of the approximated Gaussian distribution [[15](#_bookmark29)]. It is initially surprising that this distribution does not require more time to sort than uniformly distributed numbers. However, interleaving buckets in the LSD passes (successive buckets are assigned to diﬀerent threads) avoids load imbalance, and increased occupancy of the central buckets improves locality at the memory page level.

**6** **Conclusion**

We have introduced improvements to counting sort and a novel variant of radix sort for integer key/value pairs. Bandwidth measurements indicate our algo- rithm’s throughput is within 11% of the theoretical optimum for the given hard- ware. It outperforms the recently published results of Intel’s radix sort by a factor of 1.64 and also outpaces a Fermi GPU when data transfer overhead is included.

168 J. Wassenberg and P. Sanders

These results indicate that scalar, bandwidth-sensitive sorting algorithms still have their place on current architectures. However, achieving this level of perfor- mance requires awareness of the underlying microarchitecture and some degree of tuning. Our implementation encompasses 5700 lines of C++ (including tests), plus 40,000 lines of shared infrastructure. A demo executable [[16](#_bookmark30)] capable of gen- erating or reading 32-bit integers, sorting and eﬃciently writing them to disk is being made available so that our measurements may be reproduced.

Future Work: While carefully engineered, our implementation is not yet a general solution for all possible sorting applications. Radix sort is limited to relatively small integer keys, and we also assume at least one of the key digits (the MSB) is reasonably equally distributed. Skewed (e.g. constant) distributions currently re- sult in load imbalance. This could be avoided by sorting extremely large buckets

from the MSD phase using multiple processors.

We are also interested in testing on larger multi-socket machines with higher NUMA factors and investigating details of the memory subsystem that reduce eﬀective bandwidth. Finally, we believe the general software write-combining technique can provide similar speedups for other memory-intensive applications. In particular, comparison-based sample sort is also expected to beneﬁt from our implementation techniques.

**References**

1. Bohannon, P., McIlroy, P., Rastogi, R.: Main-memory index structures with ﬁxed- size partial keys. In: SIGMOD Conference, pp. 163–174 (2001), [http://www.acm.org/sigs/sigmod/sigmod01/eproceedings/papers/](http://www.acm.org/sigs/sigmod/sigmod01/eproceedings/papers/Research-Bohannon-et-al.pdf) [Research-Bohannon-et-al.pdf](http://www.acm.org/sigs/sigmod/sigmod01/eproceedings/papers/Research-Bohannon-et-al.pdf)

2. Satish, N., Kim, C., Chhugani, J., Nguyen, A., Lee, V., Kim, D., Dubey, P.: Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Elma- garmid, A., Agrawal, D. (eds.) SIGMOD Conference, pp. 351–362. ACM Press, New York (2010), <http://doi.acm.org/10.1145/1807167.1807207>

3. Mehlhorn, Sanders: Scanning multiple sequences via cache memory. Algorith- mica 35 (2003)

4. Intel. Intel Architecture Software Developer Manual (2010), System Programming Guide, <http://www.intel.com/Assets/PDF/manual/253668.pdf>

5. Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Man- ual (November 2007), <http://www.intel.com/design/processor/manuals/248966.pdf>

6. Wassenberg, J., Middelmann, W., Sanders, P.: An eﬃcient parallel algorithm for graph-based image segmentation (June 2009), [http://algo2.iti.uni-karlsruhe.de/wassenberg/](http://algo2.iti.uni-karlsruhe.de/wassenberg/wassenberg09parallelSegmentation.pdf) [wassenberg09parallelSegmentation.pdf](http://algo2.iti.uni-karlsruhe.de/wassenberg/wassenberg09parallelSegmentation.pdf)

7. Jimenez-Gonzalez, D., Navarro, J., Larriba-Pey, J.: Fast parallel in-memory 64-bit sorting. In: Proceedings of the 2001 International Conference on Supercomputing (15th ICS 2001), Sorrento, Napoli, Italy, pp. 114–122. ACM, New York (2001)

Engineering a Multi-core Radix Sort 169

8. an Mey, D., Terboven, C.: Aﬃnity matters! OpenMP on multicore and ccNUMA architectures. In: Parallel Computing: Architectures, Algorithms and Applications, vol. 15, Forschungszentrum J![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAANCAYAAACUwi84AAAAg0lEQVQYla3OPQ4BURQF4O+RkBidBJXoJioVtUVorMSaVJZgExKxDFGZwmjuSyZjCoXTnJOb83P5ESPUmOOEY5dpETxBkY+94DWWSBhig2k2JRzwxAAlbtjnhhpnVBG449qeEMYvNA1VQ8+y6Ae/scULK+zil0tqNY6jqYjQo2v2z/gAdmISsqqr1u0AAAAASUVORK5CYII=)lich and RWTH Aachen University ( Febuary 2008), [http://www.compunity.org/events/pastevents/parco07/](http://www.compunity.org/events/pastevents/parco07/AffinityMatters_DaM.pdf) [AffinityMatters ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAABAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooA//2Q==) DaM.pdf](http://www.compunity.org/events/pastevents/parco07/AffinityMatters_DaM.pdf)

9. Panneton, F., L’Ecuyer, P., Matsumoto, M.: Improved long-period generators based on linear recurrences modulo 2. ACM Transactions on Mathematical Software 32

(2006)

10. Satish, N., Kim, C., Chhugani, J., Nguyen, A., Lee, V., Kim, D., Dubey, P.: Fast sort on CPUs, GPUs and intel MIC architectures. Technical report, Intel (2010), [http://techresearch.intel.com/userfiles/en-us/](http://techresearch.intel.com/userfiles/en-us/FASTsort_CPUsGPUs_IntelMICarchitectures.pdf) [FASTsort ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAACAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooA//2Q==) CPUsGPUs ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAACAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooA//2Q==) IntelMICarchitectures.pdf](http://techresearch.intel.com/userfiles/en-us/FASTsort_CPUsGPUs_IntelMICarchitectures.pdf)

11. Merrill, D., Grimshaw, A.: Revisiting sorting for GPGPU stream ar- chitectures. Technical Report 3, University of Virginia (February 2010), <http://www.cs.virginia.edu/~dgm4d/papers/RadixSortTR.pdf>

12. Levinthal, D.: Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors. Intel, [http://software.intel.com/sites/products/collateral/hpc/vtune/](http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf) [performance ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAACAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooA//2Q==) analysis ![](data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAACAAUDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD5/ooooA//2Q==) guide.pdf](http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf)

13. Besedin, D.: RightMark memory analyzer, <http://cpu.rightmark.org> (accessed January 9, 2009)

14. Jacob, B., Ng, S., Wang, D.: Memory systems: cache, DRAM, disk. Morgan Kauf- mann, San Francisco (2007)

15. Helman, D., Bader, D., J![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAYAAAANCAYAAACKCx+LAAAAf0lEQVQYlZXPMQ4BURSF4U/eoFEMsQGSmVokorODsQAbsBRWYA0WoJ7eCkarsAAtgua9RjX+5Oaem5OTnEsLRnhh/WvssUhHiLtABxfck7nBAUPUWKXEGzdcMcUSpwwPbDFBP46AM3Y4Yo4xngElZvggR4UmFehhEHW3zdN/8AVi2RFQolxXUQAAAABJRU5ErkJggg==)J![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAcAAAANCAYAAABlyXS1AAAAhklEQVQYla3QPw4BYRQE8N9af5otcAMSao1s5wIbDuACruIECoVTiHJPoF+tQqXaVgTNJ/kKEoVJXt5k5r0phh/Rxx2LT+Ya01hIwx4hwQl1fLDEBj2UmMWfD1xwxhA59tDEDSsM0AkDGjhghy2uQS+QpBhjgie6mKPC8Z3QRhZ460sP/8IL2JESBw3NkWIAAAAASUVORK5CYII=), J.: A randomized parallel sorting algorithm with an experimental study. J. Parallel Distrib. Comput. 52(1), 1–23 (1998)

16. Wassenberg, J.: Vmcsort demo (May 2011), <http://algo2.iti.kit.edu/wassenberg/vmcsort/demo.html>